Assignment_1

Author

Nicole Tang

Due Date

This assignment is due by 11:59pm Pacific Time, September 27th, 2024.

Learning Goals

  • Download, read, and get familiar with an external dataset.
  • Step through the EDA “checklist” presented in class
  • Practice making exploratory plots

Assignment Description

We will work with air pollution data from the U.S. Environmental Protection Agency (EPA). The EPA has a national monitoring network of air pollution sites that The primary question you will answer is whether daily concentrations of PM\(_{2.5}\) (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022).

A primer on particulate matter air pollution can be found here.

Your assignment should be completed in Quarto or R Markdown.

Steps

  1. Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using data.table(). For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.
#read in data
air2002 <- fread("ad_viz_plotval_data_2022.csv")
air2022 <- fread("ad_viz_plotval_data_2002.csv")
#observe data
dim(air2002)
[1] 59756    22
head(air2002)
         Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
       <char> <char>    <int> <int>                          <num>   <char>
1: 01/01/2022    AQS 60010007     3                           12.7 ug/m3 LC
2: 01/02/2022    AQS 60010007     3                           13.9 ug/m3 LC
3: 01/03/2022    AQS 60010007     3                            7.1 ug/m3 LC
4: 01/04/2022    AQS 60010007     3                            3.7 ug/m3 LC
5: 01/05/2022    AQS 60010007     3                            4.2 ug/m3 LC
6: 01/06/2022    AQS 60010007     3                            3.8 ug/m3 LC
   Daily AQI Value Local Site Name Daily Obs Count Percent Complete
             <int>          <char>           <int>            <num>
1:              58       Livermore               1              100
2:              60       Livermore               1              100
3:              39       Livermore               1              100
4:              21       Livermore               1              100
5:              23       Livermore               1              100
6:              21       Livermore               1              100
   AQS Parameter Code AQS Parameter Description Method Code
                <int>                    <char>       <int>
1:              88101  PM2.5 - Local Conditions         170
2:              88101  PM2.5 - Local Conditions         170
3:              88101  PM2.5 - Local Conditions         170
4:              88101  PM2.5 - Local Conditions         170
5:              88101  PM2.5 - Local Conditions         170
6:              88101  PM2.5 - Local Conditions         170
                     Method Description CBSA Code
                                 <char>     <int>
1: Met One BAM-1020 Mass Monitor w/VSCC     41860
2: Met One BAM-1020 Mass Monitor w/VSCC     41860
3: Met One BAM-1020 Mass Monitor w/VSCC     41860
4: Met One BAM-1020 Mass Monitor w/VSCC     41860
5: Met One BAM-1020 Mass Monitor w/VSCC     41860
6: Met One BAM-1020 Mass Monitor w/VSCC     41860
                           CBSA Name State FIPS Code      State
                              <char>           <int>     <char>
1: San Francisco-Oakland-Hayward, CA               6 California
2: San Francisco-Oakland-Hayward, CA               6 California
3: San Francisco-Oakland-Hayward, CA               6 California
4: San Francisco-Oakland-Hayward, CA               6 California
5: San Francisco-Oakland-Hayward, CA               6 California
6: San Francisco-Oakland-Hayward, CA               6 California
   County FIPS Code  County Site Latitude Site Longitude
              <int>  <char>         <num>          <num>
1:                1 Alameda      37.68753      -121.7842
2:                1 Alameda      37.68753      -121.7842
3:                1 Alameda      37.68753      -121.7842
4:                1 Alameda      37.68753      -121.7842
5:                1 Alameda      37.68753      -121.7842
6:                1 Alameda      37.68753      -121.7842
tail(air2002)
         Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
       <char> <char>    <int> <int>                          <num>   <char>
1: 12/01/2022    AQS 61131003     1                            3.4 ug/m3 LC
2: 12/07/2022    AQS 61131003     1                            3.8 ug/m3 LC
3: 12/13/2022    AQS 61131003     1                            6.0 ug/m3 LC
4: 12/19/2022    AQS 61131003     1                           34.8 ug/m3 LC
5: 12/25/2022    AQS 61131003     1                           23.2 ug/m3 LC
6: 12/31/2022    AQS 61131003     1                            1.0 ug/m3 LC
   Daily AQI Value      Local Site Name Daily Obs Count Percent Complete
             <int>               <char>           <int>            <num>
1:              19 Woodland-Gibson Road               1              100
2:              21 Woodland-Gibson Road               1              100
3:              33 Woodland-Gibson Road               1              100
4:              99 Woodland-Gibson Road               1              100
5:              77 Woodland-Gibson Road               1              100
6:               6 Woodland-Gibson Road               1              100
   AQS Parameter Code AQS Parameter Description Method Code
                <int>                    <char>       <int>
1:              88101  PM2.5 - Local Conditions         145
2:              88101  PM2.5 - Local Conditions         145
3:              88101  PM2.5 - Local Conditions         145
4:              88101  PM2.5 - Local Conditions         145
5:              88101  PM2.5 - Local Conditions         145
6:              88101  PM2.5 - Local Conditions         145
                                      Method Description CBSA Code
                                                  <char>     <int>
1: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
2: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
3: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
4: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
5: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
6: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
                                 CBSA Name State FIPS Code      State
                                    <char>           <int>     <char>
1: Sacramento--Roseville--Arden-Arcade, CA               6 California
2: Sacramento--Roseville--Arden-Arcade, CA               6 California
3: Sacramento--Roseville--Arden-Arcade, CA               6 California
4: Sacramento--Roseville--Arden-Arcade, CA               6 California
5: Sacramento--Roseville--Arden-Arcade, CA               6 California
6: Sacramento--Roseville--Arden-Arcade, CA               6 California
   County FIPS Code County Site Latitude Site Longitude
              <int> <char>         <num>          <num>
1:              113   Yolo      38.66121      -121.7327
2:              113   Yolo      38.66121      -121.7327
3:              113   Yolo      38.66121      -121.7327
4:              113   Yolo      38.66121      -121.7327
5:              113   Yolo      38.66121      -121.7327
6:              113   Yolo      38.66121      -121.7327
colnames(air2002)
 [1] "Date"                           "Source"                        
 [3] "Site ID"                        "POC"                           
 [5] "Daily Mean PM2.5 Concentration" "Units"                         
 [7] "Daily AQI Value"                "Local Site Name"               
 [9] "Daily Obs Count"                "Percent Complete"              
[11] "AQS Parameter Code"             "AQS Parameter Description"     
[13] "Method Code"                    "Method Description"            
[15] "CBSA Code"                      "CBSA Name"                     
[17] "State FIPS Code"                "State"                         
[19] "County FIPS Code"               "County"                        
[21] "Site Latitude"                  "Site Longitude"                
str(air2002)
Classes 'data.table' and 'data.frame':  59756 obs. of  22 variables:
 $ Date                          : chr  "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Daily Mean PM2.5 Concentration: num  12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily AQI Value               : int  58 60 39 21 23 21 13 38 59 55 ...
 $ Local Site Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily Obs Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS Parameter Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS Parameter Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method Code                   : int  170 170 170 170 170 170 170 170 170 170 ...
 $ Method Description            : chr  "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" ...
 $ CBSA Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State FIPS Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County FIPS Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site Longitude                : num  -122 -122 -122 -122 -122 ...
 - attr(*, ".internal.selfref")=<externalptr> 
summary(air2002)
     Date              Source             Site ID              POC       
 Length:59756       Length:59756       Min.   :60010007   Min.   : 1.00  
 Class :character   Class :character   1st Qu.:60290019   1st Qu.: 1.00  
 Mode  :character   Mode  :character   Median :60631006   Median : 3.00  
                                       Mean   :60563315   Mean   : 3.77  
                                       3rd Qu.:60731026   3rd Qu.: 3.00  
                                       Max.   :61131003   Max.   :24.00  
                                                                         
 Daily Mean PM2.5 Concentration    Units           Daily AQI Value 
 Min.   : -6.700                Length:59756       Min.   :  0.00  
 1st Qu.:  4.100                Class :character   1st Qu.: 23.00  
 Median :  6.800                Mode  :character   Median : 38.00  
 Mean   :  8.428                                   Mean   : 39.28  
 3rd Qu.: 10.700                                   3rd Qu.: 54.00  
 Max.   :302.500                                   Max.   :454.00  
                                                                   
 Local Site Name    Daily Obs Count Percent Complete AQS Parameter Code
 Length:59756       Min.   :1       Min.   :100      Min.   :88101     
 Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
 Mode  :character   Median :1       Median :100      Median :88101     
                    Mean   :1       Mean   :100      Mean   :88192     
                    3rd Qu.:1       3rd Qu.:100      3rd Qu.:88101     
                    Max.   :1       Max.   :100      Max.   :88502     
                                                                       
 AQS Parameter Description  Method Code  Method Description   CBSA Code    
 Length:59756              Min.   :143   Length:59756       Min.   :12540  
 Class :character          1st Qu.:170   Class :character   1st Qu.:31080  
 Mode  :character          Median :170   Mode  :character   Median :40140  
                           Mean   :336                      Mean   :34957  
                           3rd Qu.:707                      3rd Qu.:41860  
                           Max.   :810                      Max.   :49700  
                                                            NA's   :4567   
  CBSA Name         State FIPS Code    State           County FIPS Code
 Length:59756       Min.   :6       Length:59756       Min.   :  1.00  
 Class :character   1st Qu.:6       Class :character   1st Qu.: 29.00  
 Mode  :character   Median :6       Mode  :character   Median : 63.00  
                    Mean   :6                          Mean   : 56.19  
                    3rd Qu.:6                          3rd Qu.: 73.00  
                    Max.   :6                          Max.   :113.00  
                                                                       
    County          Site Latitude   Site Longitude  
 Length:59756       Min.   :32.58   Min.   :-124.2  
 Class :character   1st Qu.:34.07   1st Qu.:-121.4  
 Mode  :character   Median :36.49   Median :-119.6  
                    Mean   :36.24   Mean   :-119.6  
                    3rd Qu.:37.96   3rd Qu.:-117.9  
                    Max.   :41.76   Max.   :-115.5  
                                                    
colSums(is.na(air2002))
                          Date                         Source 
                             0                              0 
                       Site ID                            POC 
                             0                              0 
Daily Mean PM2.5 Concentration                          Units 
                             0                              0 
               Daily AQI Value                Local Site Name 
                             0                              0 
               Daily Obs Count               Percent Complete 
                             0                              0 
            AQS Parameter Code      AQS Parameter Description 
                             0                              0 
                   Method Code             Method Description 
                             0                              0 
                     CBSA Code                      CBSA Name 
                          4567                              0 
               State FIPS Code                          State 
                             0                              0 
              County FIPS Code                         County 
                             0                              0 
                 Site Latitude                 Site Longitude 
                             0                              0 

Dimensions: 22x59756 Column Names(type): Date(chr), Source(chr), Site ID(int), POC(int), Daily Mean PM2.5 Concentration(num), Units(chr), Daily AQI Value(int), Local Site Name(chr), Daily Obs Count(int), Percent Complete(num), AQS Parameter Code(int), AQS Parameter Description (chr), Method Code(int), Method Description(chr), CBSA Code(int), CBSA Name(chr), State FIPS Code(int), State(chr), County FIPS Code(int), County(chr), Site Latitude(num), Site Longitude(num) No NAs in the data set

dim(air2022)
[1] 15976    22
head(air2022)
         Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
       <char> <char>    <int> <int>                          <num>   <char>
1: 01/05/2002    AQS 60010007     1                           25.1 ug/m3 LC
2: 01/06/2002    AQS 60010007     1                           31.6 ug/m3 LC
3: 01/08/2002    AQS 60010007     1                           21.4 ug/m3 LC
4: 01/11/2002    AQS 60010007     1                           25.9 ug/m3 LC
5: 01/14/2002    AQS 60010007     1                           34.5 ug/m3 LC
6: 01/17/2002    AQS 60010007     1                           41.0 ug/m3 LC
   Daily AQI Value Local Site Name Daily Obs Count Percent Complete
             <int>          <char>           <int>            <num>
1:              81       Livermore               1              100
2:              93       Livermore               1              100
3:              74       Livermore               1              100
4:              82       Livermore               1              100
5:              98       Livermore               1              100
6:             115       Livermore               1              100
   AQS Parameter Code AQS Parameter Description Method Code
                <int>                    <char>       <int>
1:              88101  PM2.5 - Local Conditions         120
2:              88101  PM2.5 - Local Conditions         120
3:              88101  PM2.5 - Local Conditions         120
4:              88101  PM2.5 - Local Conditions         120
5:              88101  PM2.5 - Local Conditions         120
6:              88101  PM2.5 - Local Conditions         120
                      Method Description CBSA Code
                                  <char>     <int>
1: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
2: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
3: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
4: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
5: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
6: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
                           CBSA Name State FIPS Code      State
                              <char>           <int>     <char>
1: San Francisco-Oakland-Hayward, CA               6 California
2: San Francisco-Oakland-Hayward, CA               6 California
3: San Francisco-Oakland-Hayward, CA               6 California
4: San Francisco-Oakland-Hayward, CA               6 California
5: San Francisco-Oakland-Hayward, CA               6 California
6: San Francisco-Oakland-Hayward, CA               6 California
   County FIPS Code  County Site Latitude Site Longitude
              <int>  <char>         <num>          <num>
1:                1 Alameda      37.68753      -121.7842
2:                1 Alameda      37.68753      -121.7842
3:                1 Alameda      37.68753      -121.7842
4:                1 Alameda      37.68753      -121.7842
5:                1 Alameda      37.68753      -121.7842
6:                1 Alameda      37.68753      -121.7842
colnames(air2022)
 [1] "Date"                           "Source"                        
 [3] "Site ID"                        "POC"                           
 [5] "Daily Mean PM2.5 Concentration" "Units"                         
 [7] "Daily AQI Value"                "Local Site Name"               
 [9] "Daily Obs Count"                "Percent Complete"              
[11] "AQS Parameter Code"             "AQS Parameter Description"     
[13] "Method Code"                    "Method Description"            
[15] "CBSA Code"                      "CBSA Name"                     
[17] "State FIPS Code"                "State"                         
[19] "County FIPS Code"               "County"                        
[21] "Site Latitude"                  "Site Longitude"                
tail(air2022)
         Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
       <char> <char>    <int> <int>                          <num>   <char>
1: 12/10/2002    AQS 61131003     1                             15 ug/m3 LC
2: 12/13/2002    AQS 61131003     1                             15 ug/m3 LC
3: 12/22/2002    AQS 61131003     1                              1 ug/m3 LC
4: 12/25/2002    AQS 61131003     1                             23 ug/m3 LC
5: 12/28/2002    AQS 61131003     1                              5 ug/m3 LC
6: 12/31/2002    AQS 61131003     1                              6 ug/m3 LC
   Daily AQI Value      Local Site Name Daily Obs Count Percent Complete
             <int>               <char>           <int>            <num>
1:              62 Woodland-Gibson Road               1              100
2:              62 Woodland-Gibson Road               1              100
3:               6 Woodland-Gibson Road               1              100
4:              77 Woodland-Gibson Road               1              100
5:              28 Woodland-Gibson Road               1              100
6:              33 Woodland-Gibson Road               1              100
   AQS Parameter Code AQS Parameter Description Method Code
                <int>                    <char>       <int>
1:              88101  PM2.5 - Local Conditions         117
2:              88101  PM2.5 - Local Conditions         117
3:              88101  PM2.5 - Local Conditions         117
4:              88101  PM2.5 - Local Conditions         117
5:              88101  PM2.5 - Local Conditions         117
6:              88101  PM2.5 - Local Conditions         117
                      Method Description CBSA Code
                                  <char>     <int>
1: R & P Model 2000 PM2.5 Sampler w/WINS     40900
2: R & P Model 2000 PM2.5 Sampler w/WINS     40900
3: R & P Model 2000 PM2.5 Sampler w/WINS     40900
4: R & P Model 2000 PM2.5 Sampler w/WINS     40900
5: R & P Model 2000 PM2.5 Sampler w/WINS     40900
6: R & P Model 2000 PM2.5 Sampler w/WINS     40900
                                 CBSA Name State FIPS Code      State
                                    <char>           <int>     <char>
1: Sacramento--Roseville--Arden-Arcade, CA               6 California
2: Sacramento--Roseville--Arden-Arcade, CA               6 California
3: Sacramento--Roseville--Arden-Arcade, CA               6 California
4: Sacramento--Roseville--Arden-Arcade, CA               6 California
5: Sacramento--Roseville--Arden-Arcade, CA               6 California
6: Sacramento--Roseville--Arden-Arcade, CA               6 California
   County FIPS Code County Site Latitude Site Longitude
              <int> <char>         <num>          <num>
1:              113   Yolo      38.66121      -121.7327
2:              113   Yolo      38.66121      -121.7327
3:              113   Yolo      38.66121      -121.7327
4:              113   Yolo      38.66121      -121.7327
5:              113   Yolo      38.66121      -121.7327
6:              113   Yolo      38.66121      -121.7327
str(air2022)
Classes 'data.table' and 'data.frame':  15976 obs. of  22 variables:
 $ Date                          : chr  "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Daily Mean PM2.5 Concentration: num  25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily AQI Value               : int  81 93 74 82 98 115 89 62 69 107 ...
 $ Local Site Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily Obs Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS Parameter Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS Parameter Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method Code                   : int  120 120 120 120 120 120 120 120 120 120 ...
 $ Method Description            : chr  "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" ...
 $ CBSA Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State FIPS Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County FIPS Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site Longitude                : num  -122 -122 -122 -122 -122 ...
 - attr(*, ".internal.selfref")=<externalptr> 
summary(air2022)
     Date              Source             Site ID              POC       
 Length:15976       Length:15976       Min.   :60010007   Min.   :1.000  
 Class :character   Class :character   1st Qu.:60290014   1st Qu.:1.000  
 Mode  :character   Mode  :character   Median :60590007   Median :1.000  
                                       Mean   :60549600   Mean   :1.581  
                                       3rd Qu.:60731002   3rd Qu.:1.000  
                                       Max.   :61131003   Max.   :6.000  
                                                                         
 Daily Mean PM2.5 Concentration    Units           Daily AQI Value 
 Min.   :  0.00                 Length:15976       Min.   :  0.00  
 1st Qu.:  7.00                 Class :character   1st Qu.: 39.00  
 Median : 12.00                 Mode  :character   Median : 56.00  
 Mean   : 16.12                                    Mean   : 59.28  
 3rd Qu.: 20.50                                    3rd Qu.: 72.00  
 Max.   :104.30                                    Max.   :185.00  
                                                                   
 Local Site Name    Daily Obs Count Percent Complete AQS Parameter Code
 Length:15976       Min.   :1       Min.   :100      Min.   :88101     
 Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
 Mode  :character   Median :1       Median :100      Median :88101     
                    Mean   :1       Mean   :100      Mean   :88215     
                    3rd Qu.:1       3rd Qu.:100      3rd Qu.:88502     
                    Max.   :1       Max.   :100      Max.   :88502     
                                                                       
 AQS Parameter Description  Method Code  Method Description   CBSA Code    
 Length:15976              Min.   :117   Length:15976       Min.   :12540  
 Class :character          1st Qu.:120   Class :character   1st Qu.:23420  
 Mode  :character          Median :120   Mode  :character   Median :40140  
                           Mean   :297                      Mean   :33270  
                           3rd Qu.:707                      3rd Qu.:41740  
                           Max.   :810                      Max.   :49700  
                                                            NA's   :929    
  CBSA Name         State FIPS Code    State           County FIPS Code
 Length:15976       Min.   :6       Length:15976       Min.   :  1.00  
 Class :character   1st Qu.:6       Class :character   1st Qu.: 29.00  
 Mode  :character   Median :6       Mode  :character   Median : 59.00  
                    Mean   :6                          Mean   : 54.78  
                    3rd Qu.:6                          3rd Qu.: 73.00  
                    Max.   :6                          Max.   :113.00  
                                                                       
    County          Site Latitude   Site Longitude  
 Length:15976       Min.   :32.63   Min.   :-124.2  
 Class :character   1st Qu.:34.07   1st Qu.:-121.4  
 Mode  :character   Median :35.36   Median :-119.1  
                    Mean   :36.00   Mean   :-119.4  
                    3rd Qu.:37.77   3rd Qu.:-117.9  
                    Max.   :41.71   Max.   :-115.5  
                                                    
colSums(is.na(air2022))
                          Date                         Source 
                             0                              0 
                       Site ID                            POC 
                             0                              0 
Daily Mean PM2.5 Concentration                          Units 
                             0                              0 
               Daily AQI Value                Local Site Name 
                             0                              0 
               Daily Obs Count               Percent Complete 
                             0                              0 
            AQS Parameter Code      AQS Parameter Description 
                             0                              0 
                   Method Code             Method Description 
                             0                              0 
                     CBSA Code                      CBSA Name 
                           929                              0 
               State FIPS Code                          State 
                             0                              0 
              County FIPS Code                         County 
                             0                              0 
                 Site Latitude                 Site Longitude 
                             0                              0 

Dimensions: 22x15976 Column Names(type): Date(chr), Source(chr), Site ID(int), POC(int), Daily Mean PM2.5 Concentration(num), Units(chr), Daily AQI Value(int), Local Site Name(chr), Daily Obs Count(int), Percent Complete(num), AQS Parameter Code(int), AQS Parameter Description (chr), Method Code(int), Method Description(chr), CBSA Code(int), CBSA Name(chr), State FIPS Code(int), State(chr), County FIPS Code(int), County(chr), Site Latitude(num), Site Longitude(num) No NAs in the data set

  1. Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.
air <- rbind(air2002, air2022)

summary(air$Date)
   Length     Class      Mode 
    75732 character character 
air <- air %>%
  mutate(Year = year(as.Date(Date, format = "%m/%d/%Y")))

air <- air %>%
  rename(PM2.5 = "Daily Mean PM2.5 Concentration")
air <- air %>%
  rename(lat = "Site Latitude")
air <- air %>%
  rename(long = "Site Longitude")
air <- air %>%
  rename(site = "Local Site Name")
  1. Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.
pal <- colorFactor(c("lightgreen", "purple"), domain = unique(air$Year))

# Create a leaflet map
leaflet(data = air) %>%
  addTiles() %>%
  addCircleMarkers(
    ~long, ~lat,  
    color = ~pal(Year),     
    radius = 5,             
    fillOpacity = 0.7 
  ) %>%
  addLegend("bottomright", pal = pal, values = ~Year,
            title = "Year",
            opacity = 1)

The graph shows that there are stations all throughout California that have been relaying data in 2002 and 2022.

sum <- air %>%
  group_by(Year) %>%
  summarize(
    Mean = mean(PM2.5, na.rm = TRUE),   
    Median = median(PM2.5, na.rm = TRUE), 
    Min = min(PM2.5, na.rm = TRUE),       
    Max = max(PM2.5, na.rm = TRUE),       
    Count = n()                            
  )
print(sum)
# A tibble: 2 × 6
   Year  Mean Median   Min   Max Count
  <dbl> <dbl>  <dbl> <dbl> <dbl> <int>
1  2002 16.1    12     0    104. 15976
2  2022  8.43    6.8  -6.7  302. 59756

For 2002: Mean: 16.12, Min:0, Max: 104.3, Count:15976 For 2022: Mean: 8.43, Min:-6.7, Max:302.5, Count:59756 2002 had a higher mean PM2.5 concentration. 2022 had a higher max PM2.5 value. There are more data points from 2022.

  1. Check for any missing or implausible values of PM\(_{2.5}\) in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.
sum(is.na(air$PM2.5))
[1] 0
ggplot(data = air, aes(x = as.factor(Year), y = PM2.5, fill = as.factor(Year))) + 
  geom_boxplot(outlier.colour = "lightblue") +
  labs(title = "Box Plot of PM2.5 by Year",
       x = "Year",
       y = "PM2.5 Concentration (µg/m³)")

sum <- air %>%
  group_by(Year) %>%
  summarize(
    Mean = mean(PM2.5, na.rm = TRUE),   
    Median = median(PM2.5, na.rm = TRUE), 
    Min = min(PM2.5, na.rm = TRUE),       
    Max = max(PM2.5, na.rm = TRUE)
  )
print(sum)
# A tibble: 2 × 5
   Year  Mean Median   Min   Max
  <dbl> <dbl>  <dbl> <dbl> <dbl>
1  2002 16.1    12     0    104.
2  2022  8.43    6.8  -6.7  302.

There appears to be a negative value of PM2.5 in 2022. 2022 also appears to have a very high mx but there are multiple high values so it is possibly valid. The average PM2.5 has also halved since 2002.

  1. Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.

    • state
    • county
    • sites in Los Angeles
ggplot(air, aes(x = State, y = PM2.5, fill = as.factor(Year))) + 
  geom_boxplot(na.rm = TRUE) + 
  labs(x = "State", 
       y = "PM2.5 Concentration (µg/m³)", 
       title = "Box Plot of PM2.5 by State") + 
  theme_minimal() 

Based on this box plot, you can see that 2022 has higher max values while 2002 has a higher average

ggplot(air, aes(x = State, y = PM2.5, fill = as.factor(Year))) + 
  geom_violin(na.rm = TRUE) + 
  labs(x = "State", 
       y = "PM2.5 Concentration (µg/m³)", 
       title = "Violin Plot of PM2.5 by State") + 
  theme_minimal()

Based on this box plot, you can see that 2022 has higher max values while 2002 has a higher average. You can also see that 2002 is more concentrated around the mean.

ggplot(air, aes(x = County, y = PM2.5, fill = as.factor(Year))) + 
  geom_bar(stat = "summary", fun = mean, position = "dodge", na.rm = TRUE) + 
  labs(x = "County", 
       y = "Average PM2.5 Concentration (µg/m³)", 
       title = "Average PM2.5 by County") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

This bar graph shows the difference in PM2.5 by year in the various counties. You can observe that 2002 has higher averages than 2002 in all counties

losAngeles <- air %>%
  filter(County == "Los Angeles")

ggplot(losAngeles, aes(x = site, y = PM2.5, color = as.factor(Year))) + 
  stat_summary(fun = mean, geom = "point", size = 3, na.rm = TRUE) +  
  stat_summary(fun = mean, geom = "line", aes(group = Year), na.rm = TRUE) +  
  labs(x = "Site in LA", 
       y = "Average PM2.5 Concentration (µg/m³)", 
       title = "Average PM2.5 by Site in Los Angeles") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

This line graph shows the difference in average PM2.5 by year in the sites in LA. 2002 is consistantly has a higher average than 2022.


This homework has been adapted from the case study in Roger Peng’s Exploratory Data Analysis with R